ATOM SaaS Production Runbook
**Last Updated:** 2026-02-22
**Platform Version:** v2.3
**Environment:** Production (ATOM Cloud)
**Target Audience:** DevOps Engineers, Site Reliability Engineers, On-Call Engineers
---
Table of Contents
- Overview
- Production Architecture
- Deployment Procedures
- Monitoring & Observability
- Incident Response
- Common Issues & Resolutions
- Data Backup & Recovery
- Security & Compliance
- Maintenance Windows
- Emergency Contacts
---
Overview
Platform Components
**ATOM SaaS** is a multi-tenant AI agent platform deployed on ATOM Cloud with the following components:
- **web-platform** - Main production app (Next.js + Python backend)
- URL: https://[tenant].atomagentos.com
- Components: Next.js frontend (port 3000) + Python FastAPI backend (port 8000)
- Resources: 1GB RAM, 1 CPU (shared)
- Nodes: 1 minimum (auto-scaling enabled)
- **api-service** - Dedicated Python backend API
- URL: https://[tenant].atomagentos.com/api
- Components: Python FastAPI only (port 8000)
- Resources: 1GB RAM, 1 CPU (shared)
- Nodes: 2 (rolling deployments)
- **Database** - Neon PostgreSQL
- Managed database service with connection pooling
- Automatic backups and point-in-time recovery
- **Redis** - Upstash Redis
- URL format: https://*.upstash.io
- Used for: rate limiting, caching, session storage
- **Storage** - AWS S3
- Tenant-isolated storage (s3://atom-saas/{tenant_id}/)
- Used for: file uploads, canvas assets, agent artifacts
Key Technologies
- **Frontend:** Next.js 14, React 18, TypeScript, Tailwind CSS
- **Backend:** Python 3.11+, FastAPI, SQLAlchemy, Alembic
- **Database:** PostgreSQL with Row-Level Security (RLS)
- **Deployment:** ATOM Cloud with Docker containers
- **Monitoring:** Cloud metrics, logs, health checks
---
Production Architecture
Application Deployment Strategy
The platform uses a **dual-app deployment strategy** to separate web and AI workloads:
┌─────────────────────────────────────────────────────────────┐
│ web-platform (Main) │
│ ┌──────────────────┐ ┌──────────────────┐ │
│ │ Next.js │ │ Python (ROLE=web) │
│ │ Port 3000 │ │ Port 8000 │ │
│ │ 153+ API routes│ │ Brain systems │ │
│ └──────────────────┘ └──────────────────┘ │
│ │ │ │
└─────────────────────┼────────────────────┼──────────────────┘
│ │
▼ ▼
┌─────────────┐ ┌─────────────┐
│ S3 │ │ Neon DB │
└─────────────┘ └─────────────┘
┌─────────────────────────────────────────────────────────────┐
│ api-service (Backend) │
│ ┌──────────────────┐ │
│ │ Python (ROLE=api) │
│ │ Port 8000 │
│ │ LLM processing, embeddings, reasoning │
│ └──────────────────┘ │
│ │ │
└─────────────────────┼────────────────────────────────────────┘
│
▼
┌─────────────┐ ┌─────────────┐
│ Upstash │ │ Neon DB │
│ Redis │ └─────────────┘
└─────────────┘Environment Variables
**Critical secrets** (managed via atom-cli secrets):
# Database
DATABASE_URL=postgresql://...
# Authentication
NEXTAUTH_SECRET=...
NEXTAUTH_URL=https://[tenant].atomagentos.com
# LLM Providers (BYOK)
OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...
# External Services
REDIS_URL=https://*.upstash.io
AWS_ACCESS_KEY_ID=...
AWS_SECRET_ACCESS_KEY=...
STRIPE_SECRET_KEY=sk_live_...
# Email (SES)
SES_AWS_ACCESS_KEY_ID=...
SES_AWS_SECRET_ACCESS_KEY=...
SES_REGION=us-east-1Health Check Endpoints
**Main App (web-platform):**
GET /api/health- Full health check (DB, Redis, services)- Health check interval: 15s
- Grace period: 30s
- Timeout: 10s
**Backend API (api-service):**
GET /alive- Simple liveness (no DB required)GET /health- Full health check (DB, Redis, LLM providers)- Health check interval: 30s
- Grace period: 90s
- Timeout: 10s
---
Deployment Procedures
Prerequisites Checklist
Before deploying to production, verify:
- [ ] All tests passing locally (
npm test&&cd backend-saas && pytest) - [ ] No critical security vulnerabilities (
npm audit --audit-level=high) - [ ] Database migrations tested locally (
alembic upgrade head) - [ ] Environment variables documented
- [ ] Staging environment validated (if available)
- [ ] Backup created before major changes
- [ ] Team notified of deployment
- [ ] Rollback plan documented
Deployment: Main App (web-platform)
**Standard Deployment:**
# From repository root
atom-cli deploy
# With specific Dockerfile
atom-cli deploy --dockerfile Dockerfile
# Check deployment status
atom-cli status**Deployment Process:**
- Code pushed to main branch
atom-cli deploytriggers build- Depot builder creates Docker image (cached layers)
- Release command runs migrations (
./backend-saas/scripts/run_migrations.sh) - Rolling deployment updates machines (zero downtime)
- Health checks validate service availability
- New version receives production traffic
**Expected Duration:** 3-5 minutes
**What Happens During Deployment:**
- Docker image built (cached layers speed this up)
- Database migrations run automatically
- Next.js frontend builds (production optimized)
- Python backend starts with ROLE=web
- Health checks validate all services
- Old machines replaced one-by-one (rolling update)
Deployment: Backend API (api-service)
**Standard Deployment:**
# From backend-saas directory
cd backend-saas
atom-cli deploy --config infrastructure.config
# Alternative from root
atom-cli deploy --dockerfile backend-saas/Dockerfile.api**Deployment Process:**
- Code pushed to main branch
atom-cli deploytriggers API-only build- Docker image built (Dockerfile.api)
- Migrations run during startup (lifespan function)
- Rolling deployment to 2 machines
- Health checks validate Python backend
- New version receives API traffic
**Expected Duration:** 2-4 minutes
**Key Differences from Main App:**
- Uses
Dockerfile.api(Python-only build) - ROLE=api environment variable
- Migrations run in lifespan() (not release_command)
- Auto-stop when idle (cost optimization)
- 2 machines for rolling deployments
Post-Deployment Verification
After deployment completes, verify:
# 1. Check app status
atom-cli status
# 2. Verify health endpoints
curl https://[tenant].atomagentos.com/api/health
curl https://[tenant].atomagentos.com/api/alive
# 3. Check node status
atom-cli nodes list
# 4. View recent logs
atom-cli logs --lines 50**Expected Results:**
- Health endpoints return 200 OK
- Machines show "running" state
- No errors in recent logs
- Critical paths functional (auth, agents, skills)
Rollback Procedures
**Automatic Rollback (Health Check Failure):**
If health checks fail after deployment, the platform automatically rolls back to the previous version. No manual intervention required.
**Manual Rollback:**
# View deployment history
atom-cli deployments
# Rollback to specific version
atom-cli rollback <version>**Database Rollback (if needed):**
# SSH into machine
atom-cli console
# Navigate to backend
cd /app
# Rollback last migration
alembic downgrade -1
# Rollback to specific revision
alembic downgrade <revision_id>
# Verify current revision
alembic current**⚠️ WARNING:** Database rollbacks can cause data loss if migration involved data changes. Always backup before rollback.
Zero-Downtime Deployment Strategy
**Current Setup:**
- Rolling deployments (one machine at a time)
- Health check grace period (30s main, 90s API)
- Minimum machines running (1 main, 1 API)
**Best Practices:**
- Deploy during low-traffic hours when possible
- Monitor health checks during deployment
- Have rollback plan ready
- Test migrations locally first
- Use feature flags for major changes
---
Monitoring & Observability
Key Metrics to Monitor
Application-Level Metrics
**Request Metrics:**
- Request rate (requests per second)
- Response times (p50, p95, p99)
- Error rate (4xx, 5xx)
- Throughput (requests per minute)
**Target Thresholds:**
- p95 response time: < 2s (100 concurrent users)
- Error rate: < 1%
- Request rate: Scale up if sustained > 100 req/s
**Business Metrics:**
- Agent execution rate (agents per hour)
- Graduation exam success rate (%)
- Active agents count
- Tenant activity (daily active tenants)
Infrastructure Metrics
**ATOM Cloud Metrics:**
- CPU usage (%)
- Memory usage (%)
- Disk usage (%)
- Network in/out (bytes per second)
**Target Thresholds:**
- CPU usage: Alert if > 80% for 5 minutes
- Memory usage: Alert if > 85% for 5 minutes
- Disk usage: Alert if > 90%
**Database Metrics (Neon PostgreSQL):**
- Connection pool usage (%)
- Query performance (slow queries > 1s)
- Database size (GB)
- Transaction rate (tx per second)
**Target Thresholds:**
- Connection pool: Alert if > 80%
- Slow queries: Investigate if > 10 per minute
- Database size: Alert if > 90% of quota
**Redis Metrics (Upstash):**
- Hit rate (%)
- Memory usage (%)
- Command rate (commands per second)
- Connection count
**Target Thresholds:**
- Hit rate: > 80% (indicates effective caching)
- Memory usage: Alert if > 90%
LLM Provider Metrics
**OpenAI API:**
- Request latency (p50, p95)
- Error rate (4xx, 5xx)
- Rate limit hits (429 responses)
- Token usage (tokens per day)
**Target Thresholds:**
- Request latency: < 5s p95
- Error rate: < 2%
- Rate limit hits: Alert if > 10 per minute
Monitoring Dashboards
**ATOM Cloud Console:**
- URL: https://console.atomagentos.com
- Metrics: CPU, memory, network, requests
- Logs: Real-time log streaming
- Machines: Machine status and health
**Cloud Console:**
- URL: https://console.atomagentos.com
- Metrics: CPU, memory, network, requests
- Logs: Real-time log streaming
- Nodes: Node status and health
**Neon Console:**
- Database metrics and performance
- Slow query analysis
- Connection pool monitoring
**Upstash Console:**
- Redis metrics and hit rate
- Memory usage and commands
- Connection monitoring
Log Aggregation
View real-time logs
atom-cli logs
View last N lines
atom-cli logs --lines 100
Follow logs (tail -f)
atom-cli logs --tail
**Log Levels:**
INFO- Normal operations (startup, requests)WARNING- Non-critical issues (rate limits, retries)ERROR- Errors (exceptions, failed requests)CRITICAL- Critical failures (crashes, data loss)
**Common Log Patterns:**
**Successful Request:**
INFO: 10.0.0.1:12345 - "GET /api/agents HTTP/1.1" 200 OK
INFO: Request completed in 123ms**Rate Limit:**
WARNING: Rate limit exceeded for tenant <tenant_id>
WARNING: 429 Too Many Requests**Database Error:**
ERROR: Database connection failed
ERROR: sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) could not connect**LLM Provider Error:**
ERROR: OpenAI API request failed
ERROR: openai.error.RateLimitError: Rate limit exceededAlert Thresholds
Monitoring is performed via the **IntegrationMetrics** system, which enqueues on-demand evaluation tasks to **QStash**.
**Critical Alerts (Immediate Action Required):**
| Metric | Threshold | Duration | Action |
|---|---|---|---|
| App health check | > 50% failures | 1 minute | Investigate, restart machines |
| Database connection | > 90% pool usage | 2 minutes | Check for connection leaks |
| Error rate | > 10% | 2 minutes | Check logs, identify root cause |
| CPU usage | > 90% | 5 minutes | Scale up or investigate |
| Memory usage | > 95% | 5 minutes | Scale up or restart |
| Disk usage | > 95% | 5 minutes | Clean up or scale storage |
**Warning Alerts (Monitor Closely):**
| Metric | Threshold | Duration | Action |
|---|---|---|---|
| Response time | > 3s p95 | 5 minutes | Investigate slow queries |
| Error rate | > 5% | 5 minutes | Check logs for patterns |
| CPU usage | > 80% | 10 minutes | Prepare to scale |
| Memory usage | > 85% | 10 minutes | Monitor, prepare to scale |
| Redis hit rate | < 70% | 15 minutes | Review caching strategy |
**Informational Alerts (Track Metrics):**
| Metric | Threshold | Duration | Action |
|---|---|---|---|
| Agent execution rate | < 10/hour | 1 hour | Business as usual |
| Graduation exam rate | < 5/hour | 1 hour | Business as usual |
| Daily active tenants | < 5 | 1 day | Review engagement |
Monitoring Tools
**Built-in Tools:**
- Cloud Console (metrics, logs, nodes)
- ATOM Cloud CLI (
atom-clicommands) - Neon console (database metrics)
- Upstash console (Redis metrics)
**External Tools (Optional):**
- Sentry (error tracking)
- Datadog (APM and metrics)
- Grafana (custom dashboards)
- PagerDuty (on-call routing)
---
Incident Response
Incident Severity Levels
**SEV-0 (Critical):**
- Definition: Complete service outage or data loss
- Impact: All users affected
- Response Time: Immediate (< 5 minutes)
- Examples: All machines down, database unavailable, data corruption
**SEV-1 (High):**
- Definition: Major feature degradation or partial outage
- Impact: Many users affected, critical paths broken
- Response Time: < 15 minutes
- Examples: Agent execution failing, auth broken, payment processing down
**SEV-2 (Medium):**
- Definition: Minor feature degradation or performance issues
- Impact: Some users affected, workarounds available
- Response Time: < 1 hour
- Examples: Slow response times, non-critical integration down, UI bugs
**SEV-3 (Low):**
- Definition: Cosmetic issues or edge cases
- Impact: Few users affected, no business impact
- Response Time: < 4 hours
- Examples: Typos, minor UI glitches, documentation errors
Incident Response Process
**1. Detection (Alert Received)**
- Alert triggered via monitoring
- PagerDuty/notification sent
- On-call engineer acknowledges
**2. Assessment (Understand Impact)**
- Check dashboards for metrics
- Review logs for errors
- Determine severity level
- Identify affected users
**3. Mitigation (Stop the Bleeding)**
- Implement temporary fix
- Restore service if possible
- Communicate status to users
- Document actions taken
**4. Resolution (Fix Root Cause)**
- Implement permanent fix
- Test in staging
- Deploy to production
- Verify fix works
**5. Post-Mortem (Learn and Improve)**
- Document incident timeline
- Identify root cause
- Create action items
- Update runbook if needed
Common Incidents & Playbooks
Incident 1: Database Connection Failures
**Symptoms:**
- 500 errors on all endpoints
- Logs show "could not connect to server"
- Health checks failing
**Detection:**
# Check health endpoint
curl https://[tenant].atomagentos.com/api/health
# View logs for database errors
atom-cli logs | grep -i "database\|connection"**Mitigation:**
# 1. Check DATABASE_URL secret
atom-cli secrets list
# 2. Test database connection
atom-cli console
python -c "from core.database import engine; print(engine.url)"
# 3. Restart node (connection pool leak)
atom-cli nodes restart <node-id>
# 4. Scale up (connection exhaustion)
atom-cli scale --count 2**Resolution:**
- If connection leak: Fix in code (ensure connections closed)
- If pool exhausted: Increase pool_size or scale app
- If database issue: Check Neon status page
**Prevention:**
- Enable connection pool monitoring
- Set connection timeout values
- Use connection pooling properly
- Regular restarts during maintenance
Incident 2: High Error Rates (> 10%)
**Symptoms:**
- Spike in 500 errors
- User reports of failures
- Error rate alert triggered
**Detection:**
# View error logs
atom-cli logs | grep "ERROR"
# Check recent deployments
atom-cli deployments
# View node status
atom-cli status**Mitigation:**
# 1. Check if recent deployment caused issue
atom-cli deployments
# Rollback if needed:
atom-cli rollback <version>
# 2. Restart affected node
atom-cli nodes restart <node-id>
# 3. Scale up if resource issue
atom-cli scale --cpu 2 --memory 2048
# 4. Check for downstream dependencies
# (LLM providers, Redis, database)**Resolution:**
- Identify root cause from logs
- Fix code issue and deploy
- Update runbook if new issue
- Add monitoring if needed
**Prevention:**
- Comprehensive testing before deploy
- Staging environment validation
- Gradual rollout (feature flags)
- Monitor metrics after deploy
Incident 3: Slow Response Times (> 3s p95)
**Symptoms:**
- User complaints about slowness
- Response time alert triggered
- Dashboard shows elevated latency
**Detection:**
# View recent logs with timing
atom-cli logs --service api --lines 100 | grep "Request completed"
# Check database for slow queries
atom-cli console
python -c "from core.database import engine; # check slow queries"
# Check CPU/memory
atom-cli status**Mitigation:**
# 1. Scale up (resource constraint)
atom-cli scale --cpu 2 --memory 2048
# 2. Restart node (memory leak)
atom-cli nodes restart <node-id>
# 3. Check database connection pool
# (May need to increase pool_size)
# 4. Check for long-running queries
# (Kill or optimize slow queries)**Resolution:**
- Identify slow queries and optimize
- Add database indexes if needed
- Implement caching for expensive operations
- Optimize LLM calls (reduce tokens, cache results)
**Prevention:**
- Regular performance monitoring
- Query performance testing
- Caching strategy
- Load testing before major changes
Incident 4: LLM Provider Outage
**Symptoms:**
- Agent execution failing
- OpenAI/Anthropic API errors
- 500 errors on AI-dependent endpoints
**Detection:**
# View logs for LLM errors
# View logs for LLM errors
atom-cli logs | grep -i "openai\|anthropic\|llm"
# Test LLM provider status
curl https://status.openai.com/
curl https://status.anthropic.com/**Mitigation:**
# 1. Check API keys (may have expired)
# 1. Check API keys (may have expired)
atom-cli secrets list | grep -i "api_key"
# 2. Switch to backup provider
# (Update OPENAI_API_KEY to ANTHROPIC_API_KEY in code)
# 3. Disable AI features temporarily
# (Set feature flag to skip LLM calls)
# 4. Use cached responses if available
# (Redis cache may have recent results)**Resolution:**
- Wait for provider to restore service
- Implement fallback providers in code
- Add retry logic with exponential backoff
- Cache LLM responses to reduce dependency
**Prevention:**
- Implement multiple LLM providers (BYOK)
- Add caching for LLM responses
- Implement graceful degradation
- Monitor provider status pages
Incident 5: Redis Connection Errors
**Symptoms:**
- Rate limiting not working
- Session management failing
- Cache misses (100% miss rate)
- Logs show Redis connection errors
**Detection:**
# Test Redis connectivity
atom-cli console
curl $REDIS_URL/ping
# View Redis errors
# View Redis errors
atom-cli logs | grep -i "redis"**Mitigation:**
# 1. Check REDIS_URL secret
atom-cli secrets list | grep REDIS
# 2. Test Redis directly
curl https://<redis-url>/ping
# 3. Restart app (may be connection pool issue)
atom-cli nodes restart <node-id>
# 4. Operate without Redis (degraded mode)
# (Rate limiting disabled, sessions in DB)**Resolution:**
- If Upstash outage: Wait for service restore
- If connection leak: Fix in code
- If wrong URL: Update secret
- If quota exceeded: Upgrade Upstash plan
**Prevention:**
- Monitor Redis hit rate
- Test Redis connectivity in health checks
- Implement graceful degradation (work without Redis)
- Set connection timeouts
Incident 6: Memory Leaks (High Memory Usage)
**Symptoms:**
- Memory usage steadily increasing
- Machine restarts (OOM killer)
- Performance degradation over time
**Detection:**
# Check memory usage
# Check memory usage
atom-cli status
# View memory over time
atom-cli logs | grep "memory"
# Console access to check process memory
atom-cli console
ps aux | grep python**Mitigation:**
# 1. Restart node (temporary fix)
atom-cli nodes restart <node-id>
# 2. Scale up (more memory)
atom-cli scale --memory 2048
# 3. Schedule regular restarts
# (Cron job to restart machines daily)**Resolution:**
- Identify memory leak source (profiling)
- Fix in code (unclosed connections, large objects)
- Implement memory limits (ulimit)
- Add memory monitoring alerts
**Prevention:**
- Regular memory profiling
- Load testing with memory monitoring
- Code reviews for memory management
- Automated restarts (maintenance window)
Escalation Procedures
**When to Escalate:**
- **SEV-0 Incident:** Immediate escalation to senior engineering
- **Unknown issue:** Escalate after 30 minutes of troubleshooting
- **Customer impact:** Escalate immediately if enterprise customers affected
- **Data loss risk:** Escalate immediately, involve database team
**Escalation Contact Order:**
- **On-Call Engineer** (Initial response)
- **Senior DevOps Engineer** (If unresolved in 30 minutes)
- **Engineering Manager** (If customer impact)
- **CTO** (If SEV-0 or data loss risk)
**Communication Template:**
SUBJECT: [SEV-X] <Incident Title>
SEVERITY: SEV-0/1/2/3
STATUS: Investigating/Mitigated/Resolved
STARTED: <timestamp>
AFFECTED: <users/services>
CURRENT IMPACT: <description>
CURRENT STATUS:
<What's happening now>
MITIGATION STEPS:
<What we're doing>
NEXT UPDATE: <timestamp>---
Common Issues & Resolutions
Deployment Issues
Issue 1: Build Failures
**Symptoms:**
ERROR: failed to calculate checksum: "/requirements.txt": not found**Resolution:**
- Check
.dockerignorehas!requirements*.txtat END - Verify Dockerfile paths match build context
- Try
atom-cli deploywithout cache
**Reference:** DEPLOYMENT_TROUBLESHOOTING.md
Issue 2: Migration Failures
**Symptoms:**
Error: release command failed - aborting deployment
sqlalchemy.exc.ProgrammingError: (psycopg2.errors.DuplicateTable)**Resolution:**
- Set
release_command = ""in infrastructure.config - Migrations will run in lifespan() instead
- Or make migrations idempotent (check if exists)
Issue 3: Health Check Failures
**Symptoms:**
WARNING The app is not listening on the expected address**Resolution:**
- Check app binds to 0.0.0.0 (not 127.0.0.1)
- Verify port matches infrastructure.config internal_port
- Increase grace_period in infrastructure.config
- Check for startup errors in logs
Runtime Issues
Issue 4: Machine Auto-Stops
**Symptoms:**
- atom-saas-api machines stopped
- API returns 503/404
- Machines show "stopped" status
**Resolution:**
# Start machine manually
atom-cli nodes start <id>
# Or trigger by making API request
curl https://[tenant].atomagentos.com/alive
# Disable auto-stop (if needed)
atom-cli scale --min 2Issue 5: Rate Limiting Errors
**Symptoms:**
WARNING: Rate limit exceeded for tenant <tenant_id>
429 Too Many Requests**Resolution:**
- Check if tenant exceeded plan quota
- Upgrade tenant plan if needed
- Check if Redis is working (rate limiting requires Redis)
- Reset quota if legitimate issue
Issue 6: Agent Execution Failures
**Symptoms:**
- Agent execution returns 500
- Logs show governance errors
- Episodes not being recorded
**Resolution:**
- Check agent maturity level vs action complexity
- Verify agent governance cache (may need restart)
- Check LLM provider status
- Review agent configuration
Database Issues
Issue 7: Connection Pool Exhaustion
**Symptoms:**
sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) connection pool exhausted**Resolution:**
# 1. Restart app (frees connections)
atom-cli nodes restart <id>
# 2. Scale up (more connections)
atom-cli scale --count 2
# 3. Increase pool_size in code
# (Edit database.py and redeploy)Issue 8: Slow Query Performance
**Symptoms:**
- Database queries > 1s
- API endpoints slow
- Logs show slow query warnings
**Resolution:**
# 1. Identify slow queries
atom-cli console
# Check Neon console for slow query log
# 2. Add indexes
alembic revision -m "add indexes"
# Edit migration to add indexes
# 3. Optimize query
# (Use select_in, add pagination, etc.)Integration Issues
Issue 9: OAuth Callback Failures
**Symptoms:**
- OAuth redirects fail
- Token storage errors
- Integration state not updating
**Resolution:**
- Check callback URL matches Cloud app URL
- Verify OAuth client ID/secret secrets
- Check tenant isolation in integration tables
- Review integration logs
Issue 10: Stripe Webhook Failures
**Symptoms:**
- Webhook returns 500
- Subscription events not processed
- Billing not updated
**Resolution:**
- Verify Stripe webhook secret
- Check webhook signature validation
- Test webhook endpoint with Stripe CLI
- Review tenant_id extraction
---
Data Backup & Recovery
Backup Strategy
**Database Backups (Neon PostgreSQL):**
- **Automated:** Neon provides continuous backups
- **Retention:** 7 days (point-in-time recovery available)
- **Frequency:** Continuous (WAL logs)
- **Location:** Neon-managed storage
**Storage Backups (AWS S3):**
- **Automated:** S3 versioning enabled
- **Retention:** 30 days
- **Frequency:** Per object upload
- **Location:** Same region as S3 bucket
**Redis Backups (Upstash):**
- **No automatic backups** (ephemeral cache)
- **Data can be rebuilt from database**
- **Critical:** Rate limits, sessions (can be recreated)
Backup Verification
**Weekly Backup Checks:**
# 1. List recent backups (Neon console)
# Navigate to: Neon Console > Database > Backups
# 2. Test point-in-time recovery
# (Create clone database from backup)
# 3. Verify S3 versioning
aws s3api list-object-versions --bucket atom-saas
# 4. Check Redis persistence
# (No backups - data is cache only)Recovery Procedures
Database Recovery
**Scenario 1: Restore from Backup**
# 1. Identify backup timestamp
# (Neon Console > Backups)
# 2. Create recovery database
# (Neon Console > Create Branch > Point in Time)
# 3. Update DATABASE_URL secret
atom-cli secrets set DATABASE_URL=<new-url>
# 4. Restart app to use new database
atom-cli nodes restart <id>
# 5. Verify data integrity
curl https://[tenant].atomagentos.com/api/v1/health**Scenario 2: Rollback Migration**
# 1. Access console
atom-cli console
# 2. Navigate to app directory
cd /app
# 3. Rollback last migration
alembic downgrade -1
# 4. Verify current revision
alembic current
# 5. Exit and restart node
exit
atom-cli nodes restart <id>Storage Recovery (S3)
**Scenario 1: Restore Deleted Object**
# 1. List object versions
aws s3api list-object-versions \
--bucket atom-saas \
--prefix "tenant-abc/file.pdf"
# 2. Restore specific version
aws s3api get-object \
--bucket atom-saas \
--key "tenant-abc/file.pdf" \
--version-id <version-id> \
restored-file.pdf
# 3. Upload restored object
aws s3 cp restored-file.pdf \
s3://atom-saas/tenant-abc/file.pdfRedis Recovery (Cache Rebuild)
**Scenario 1: Redis Cache Cleared**
# 1. Redis data is cache-only (no recovery needed)
# Data will be rebuilt on next request
# 2. Warm up critical caches
# (Trigger API calls to rebuild cache)
# 3. Monitor hit rate
# (Should improve over time)Disaster Recovery
**Complete Site Failure:**
**Scenario:** All Cloud Nodes down, data center outage
**Recovery Steps:**
- **Assess Impact**
- Check Cloud Status page
- Determine scope of outage
- **Restore Database**
- Create new database from backup
- Update DATABASE_URL secret
- **Redeploy App**
- **Restore S3 Data**
- S3 is separate (likely unaffected)
- Verify S3 connectivity
- **Verify Services**
- Test health endpoints
- Smoke test critical paths
- Monitor metrics
**RTO (Recovery Time Objective):** 2-4 hours
**RPO (Recovery Point Objective):** 5 minutes (Neon continuous backups)
---
Security & Compliance
Security Layers
- **Multi-Tenancy Isolation**
- Row-Level Security (RLS) on all tables
- Tenant_id required for all queries
- Subdomain-based tenant routing
- **Authentication & Authorization**
- NextAuth.js for session management
- Role-based access control (RBAC)
- Agent maturity-based permissions
- **Network Security**
- HTTPS enforced (TLS 1.2+)
- CORS configured for allowed origins
- Rate limiting (AbuseProtectionService)
- **Data Security**
- Encrypted at rest (Neon, S3)
- Encrypted in transit (TLS)
- Tenant API keys isolated
- **Application Security**
- Input validation (Pydantic schemas)
- SQL injection prevention (SQLAlchemy)
- XSS prevention (React escaping)
Security Monitoring
**Daily Checks:**
- Review error logs for security issues
- Check for failed auth attempts
- Monitor rate limit violations
**Weekly Checks:**
- Review access logs for anomalies
- Audit tenant permission changes
- Check for new vulnerabilities
**Monthly Checks:**
- Run security scans (npm audit, pip-audit)
- Review third-party dependencies
- Update runbook with new threats
Security Incidents
**Incident Types:**
- **Unauthorized Access**
- Symptoms: Suspicious login attempts, data breaches
- Response: Revoke sessions, force password reset
- Prevention: MFA, rate limiting, audit logging
- **Data Exposure**
- Symptoms: Sensitive data in logs, unauthorized queries
- Response: Rotate secrets, audit logs
- Prevention: Log redaction, query validation
- **DDoS Attack**
- Symptoms: Spike in requests, rate limit alerts
- Response: Enable Cloud DDoS protection
- Prevention: Rate limiting, CAPTCHA
Compliance
**GDPR Compliance:**
- Right to erasure:
/api/users/[id]/deleteendpoint - Data export:
/api/users/[id]/exportendpoint - Consent management: Tenant settings
**SOC 2 Compliance:**
- Audit logging: All actions logged
- Access controls: RBAC enforced
- Data encryption: At rest and in transit
- Incident response: Documented procedures
---
Maintenance Windows
Scheduled Maintenance
**Weekly Maintenance (Sundays 2-4 AM UTC):**
- Database maintenance (Neon)
- Machine restarts (memory leaks)
- Log cleanup
- Backup verification
**Monthly Maintenance (First Sunday 2-6 AM UTC):**
- Dependency updates
- Security patches
- Performance optimization
- Runbook updates
**Quarterly Maintenance:**
- Major version upgrades
- Architecture review
- Cost optimization
- Disaster recovery drill
Maintenance Process
**Before Maintenance:**
- Notify users 24 hours in advance
- Create backup (verify integrity)
- Set maintenance mode (if needed)
- Document rollback plan
**During Maintenance:**
- Execute maintenance tasks
- Verify services after changes
- Monitor metrics closely
- Have rollback ready
**After Maintenance:**
- Remove maintenance mode
- Smoke test critical paths
- Update runbook if changed
- Post-incident report (if issues)
---
Emergency Contacts
On-Call Rotation
**Primary On-Call:**
- **Name:** [On-Call Engineer]
- **Phone:** [Phone Number]
- **Email:** [Email]
- **Hours:** 24/7
**Escalation:**
- **Senior DevOps:** [Name, Phone, Email]
- **Engineering Manager:** [Name, Phone, Email]
- **CTO:** [Name, Phone, Email]
Service Providers
**ATOM Cloud Support:**
- **Status Page:** https://status.atomagentos.com
- **Support:** https://community.atomagentos.com
- **Docs:** https://docs.atomagentos.com
**Neon Database:**
- **Status Page:** https://status.neon.tech
- **Support:** support@neon.tech
- **Docs:** https://neon.tech/docs
**Upstash Redis:**
- **Status Page:** https://status.upstash.com
- **Support:** support@upstash.com
- **Docs:** https://upstash.com/docs
**AWS (S3, SES):**
- **Status Page:** https://status.aws.amazon.com
- **Support:** AWS Support Center
- **Docs:** https://docs.aws.amazon.com
**Stripe:**
- **Status Page:** https://status.stripe.com
- **Support:** https://support.stripe.com
- **Docs:** https://stripe.com/docs
Critical Services
**Monitoring & Alerting:**
- Cloud Console: https://console.atomagentos.com
- Neon Console: https://console.neon.tech
- Upstash Console: https://console.upstash.com
**Emergency Access:**
# Console access (emergency only)
atom-cli console
# Emergency restart
atom-cli nodes restart --all
# Emergency rollback
atom-cli rollback---
Appendices
Appendix A: ATOM Cloud CLI Cheat Sheet
# Apps
atom-cli list
atom-cli status
atom-cli info
# Deployments
atom-cli deploy
atom-cli deployments
atom-cli rollback <version>
# Nodes
atom-cli nodes list
atom-cli nodes start <id>
atom-cli nodes stop <id>
atom-cli nodes restart <id>
# Logs
atom-cli logs
atom-cli logs --lines 100
atom-cli logs --tail
# Secrets
atom-cli secrets list
atom-cli secrets set KEY=value
atom-cli secrets unset KEY
# Console
atom-cli console
# Scaling
atom-cli scale --count 2
atom-cli scale --cpu 2 --memory 2048
# Regions
atom-cli regions list
atom-cli regions set iad,ewrAppendix B: Database Commands
# Migrations
alembic upgrade head
alembic downgrade -1
alembic current
alembic history
alembic revision -m "description"
# Database connection
psql $DATABASE_URL
\dt # List tables
\d table_name # Describe table
\q # Quit
# Backup
pg_dump $DATABASE_URL > backup.sql
# Restore
psql $DATABASE_URL < backup.sqlAppendix C: Monitoring Queries
**Slow Queries (Neon Console):**
SELECT query, mean_exec_time, calls
FROM pg_stat_statements
ORDER BY mean_exec_time DESC
LIMIT 10;**Connection Count:**
SELECT count(*) FROM pg_stat_activity;**Table Sizes:**
SELECT
schemaname,
tablename,
pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS size
FROM pg_tables
WHERE schemaname = 'public'
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC;**Locks:**
SELECT * FROM pg_stat_activity WHERE wait_event_type = 'Lock';Appendix D: Runbook Maintenance
**Version History:**
- v1.0 (2026-02-22): Initial creation
- Future updates: Document changes here
**Update Process:**
- Make changes to this document
- Update version number and date
- Add summary of changes to version history
- Commit to repository
- Notify team of updates
**Review Schedule:**
- Monthly: Review for accuracy
- Quarterly: Major updates and improvements
- Annually: Complete rewrite if needed
---
**Document Owner:** DevOps Team
**Last Reviewed:** 2026-02-22
**Next Review:** 2026-03-22
---
**End of Production Runbook**